statfandomcom-20200213-history
Blending
Blending is a combination of two or more models in order to create a model which is better than any of them. Since different models have different weak and strong sides, blending may significantly improve performance. Blending of many (10-1000) models is necessary to achive the best possible performance in contests. In real world problems, blending is often very useful, but the number of models is not that big. For instance, for the Netflix prize contest, even the first year winning submission was a blend of 107 different models. For the real world, Netflix selected two models: Matrix Factorization and Restricted Boltzmann Machines. Their RMSE were 0.89 and 0.90 correspondently, the blend's RMSE was 0.88 . Note that RMSE of the best model was 0.8567. This was achieved by blending probably hundreds of models http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html. Definitions Let * y_i is the dependent variable for the observation i * \hat y_{mi} is the prediction for y_i by the model m * \hat y_i is the final prediction for y_i Linear regression If there are M models, and the prediction of the model m for observation i is \hat p_{mi} , then the total prediction is : \hat y_i = \sum_i \beta_m \hat y_{mi} The vector \beta are obtained via linear regression. It is important that the observation used for calculating \beta and the observations used for training the models themselves will be non-intersecting. This is because most models perform better on the training set than on the test set, therefore if we use the training set (or part of it) for estimating \beta , the coefficients will be wrong. Even estimating the metaparameters might overfit the model, therefore it may be useful to divide the set of observations into 3 parts: * training set, for training the models * validation set 1, for estimating metaparameters of the model (i. e. learning ratio) * validation set 2, for estimating \beta . Lasso or ridge regression When there are many models, it may be useful to use lasso, ridge regression, stepwise elimination, or other relaxation methods for \beta . See the linear regression article for more details. The metaparameters of relaxation methods are obtained via cross-validation of the validation set 2. Generalized Additive Model For generalized additive model (GAM), target function is a sum of one-dimensional functions: : \hat y_i = \alpha + \sum_m f_m\left(\hat y_{mi}\right)\! Algorithm: estimate f_m one by one in a loop. Each time estimate the residual \mathbf{y}-\sum_{l\ne m} \hat f_l(\mathbf{\hat y_l})\! as a smooth spline and subtract its average (so that the average of f_m will be zero). Alternatively, we can calibrate each model separately: : \hat y^{(final)}_{mi} = f_m\left(\hat y^{(raw)}_{mi}\right) As in the previous section, the validation set 2 should be used for calibrations and parameter estimations. Neural networks Sometimes, neural network is used to estimate the final prediction : \hat y_{mi} = \mathrm{NeuralNet}\left(\hat y_{1i} , ... , \hat y_{Mi}\right) Dependence on parameters Different models performs better on different parts of the input space. The linear model can be upgraded as: : \hat y_i = \sum_i \beta_m(\mathbf{x_i}) \hat y_{mi} where \mathbf{x_i} are explanatory variables. For recommender engines at the Netflix contents, the following was useful: : \hat r_{ui} = \sum_m \left( \beta_m + \beta^\prime_m \log (1+S_u) + \beta^{\prime\prime}_m \log (1+S_i) \right) \hat r^{(m)}_{ui} , where : \hat r_{ui} is the final prediction of the rating for user u, item i. : \hat r^{(m)}_{ui} is the model m's prediction of the rating for user u, item i. : S_u is the support for the user u, that is, number of items rated by him. : S_i is the support for the item, that is, number of users who rated it. Liang Sun's models Oracle blending The oracle blending model is for use in competitions. In order to describe it, recall that the ridge regression is given by the formula: : \beta = \left(\hat Y^T \hat Y + \lambda I \right)^{-1} \hat Y y , where * \beta are oracle weights * \hat Y_{mi} is the prediction of the model m for the observation i * y_i is the value of the dependent variable for the observation i * \lambda is the ridge regression parameter If we can obtain RMSE of any vector of predictions on the test set, can we use it to estimate \beta ? On the test set, we know the covariance matrix Y^T Y . As for \hat Y y , we have: : \left( \hat Y y \right)_m = \sum_i \hat y_{mi} y_i = \frac{1}{2} \left( \lVert \hat y_m - y \rVert^2 - \lVert \hat y_m \rVert^2 - \lVert y \rVert^2 \right) , where \hat y_m = \left( \hat y_{m1} , ... , \hat y_{mn} \right) , n is the number of observations The first term can be found as: : \lVert \hat y_m - y \rVert^2 = n \cdot \operatorname{RMSE}(\hat y_m)^2 Similarly, : \lVert y \rVert^2 = n \cdot \operatorname{RMSE}(0)^2 , where \operatorname{RMSE}(0) means RMSE of the submission containing zeros in all positions. Finally, \lVert \hat y_m \rVert^2 is known. Therefore, : \left( \hat Y y \right)_m = \frac{1}{2} \left( n \cdot \operatorname{RMSE}(\hat y_m)^2 - n \cdot \operatorname{RMSE}(0)^2 - \lVert y \rVert^2 \right) Therefore, we can find \beta on the test set. Bagged oracle blending For bagged oracle blending, we take N random subsets of the initial set of oracles, each subset contains k_0 oracles, N and k_0 are model parameters. For each subset, we determine weights via oracle blending. Finally, we take average weights of all N subsets. This procedure reduces overfitting. Constrained oracle blending Constrained oracle blending imposes constraints on both weights and errors on the validation set: : \min_\beta f(\beta) = \lVert y - \hat Y \beta \rVert^2 subject to: : -w_0 \le \beta_m \le w_0 for m=1,...,M : \lVert y^{(v)} - \hat Y^{(v)} \beta \rVert^2 \le t where : y^{(v)} is the vector of dependent variables on the validation set, and \hat Y^{(v)} is the matrix of model predictions, also on the validation set. : w_0 and t are parameters of the model These constraints do not let any (potentially overfit) model become too big. Finding the weights is a quadratic optimization problem.